New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add check for malloc failure in libpq calls #5446
Conversation
Codecov Report
@@ Coverage Diff @@
## main #5446 +/- ##
==========================================
- Coverage 90.72% 90.68% -0.05%
==========================================
Files 228 228
Lines 53093 53089 -4
==========================================
- Hits 48171 48145 -26
- Misses 4922 4944 +22
... and 6 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
1e2cddd
to
09def5e
Compare
tsl/src/remote/connection.c
Outdated
/* probably OOM */ | ||
elog(ERROR, "could not get default libpq options"); | ||
} | ||
Ensure(libpq_options, "out of memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be the most common error, as an improvement I suggest to make a new macro like Ensure_oom
(or something like that) using ERRCODE_OUT_OF_MEMORY
code underneath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Makes more sense than using ERRCODE_INTERNAL_ERROR
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just ereport
? Ensure is for internal program errors, and OOM is an environment error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
ereport
? Ensure is for internal program errors, and OOM is an environment error.
I think it is a matter of definitions. IMHO, Ensure
is used when environmental issues can cause problems as well, such as bad catalog data.
No strong opinion on what to use here, but if using ERRCODE_OUT_OF_MEMORY
it is simpler to replace with ereport
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is an improvement over the previous error. The pointer was already checked for NULL and the new error message is more generic than the previous one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is an improvement over the previous error. The pointer was already checked for NULL and the new error message is more generic than the previous one.
But AFAICT, it only returns NULL if it fails to allocate memory. I have no strong opinion, I can replace it with the original code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a matter of definitions.
Yes, it would be good to work out some guidelines for this. Being able to distinguish program logic errors is important for fuzzing, like SQLSmith, or some kinds of stress testing. Things like Assert
and elog
are such errors, if you see them under any conditions, it's a bug. We could make SQLSmith more efficient by adding some handling for ERRCODE_INTERNAL_ERROR
errors, and using elog
vs ereport
in a more principled manner. In the context of testing, catalog corruption is probably also an internal error? Not sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of testing, catalog corruption is probably also an internal error? Not sure...
Oh, I think it is, but it can be caused by factors out of our control, so we should have checks for it and just not assume that it is always correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a macro TS_OOM_CHECK
and used that instead. It calls ereport
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is doing the right thing.
We should have these checks where we allocate the memory, and not where we use or pass the pointers around.
We also want to try to avoid throwing errors in the networking/connection code (there is actually an effort to get away from that), because errors fail the current transaction, which is not what you always want to do (although local memory pressure might be an exception). For example, if we have a function that runs a health check on all DN connections it will fail the healthcheck on all connections if only one of them experiences an error. Therefore, we should pass the error up to higher layers, which can then decide if throwing an error is appropriate.
tsl/src/remote/connection.c
Outdated
@@ -397,7 +398,7 @@ handle_conn_destroy(PGEventConnDestroy *event) | |||
unsigned int results_count = 0; | |||
ListNode *curr; | |||
|
|||
Assert(NULL != conn); | |||
Ensure(conn, "no instance data available"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this really should be an Assert()
because it checks something that shouldn't happen. PQinstanceData just literally gets a pointer from an array and will only fail if it is not found or a passed in pointer is NULL.
The check should be elsewhere, e.g., where the allocation of the instance data occurs. If we follow this pattern, we will just have unused error messages every time we reference a pointer. Instead, this error is just masking the real issue: that a connection wasn't properly allocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this really should be an
Assert()
because it checks something that shouldn't happen. PQinstanceData just literally gets a pointer from an array and will only fail if it is not found or a passed in pointer is NULL.
Yes, and the point is that it is not entirely clear if we can have a situation where the pointer is not found. If we have, the server will crash in production if we only use an Assert. This is the idea of being defensive and rather error out than crash if we make a mistake.
The check should be elsewhere, e.g., where the allocation of the instance data occurs. If we follow this pattern, we will just have unused error messages every time we reference a pointer.
Not sure I follow you. For this case we will have an error message instead of a crash; not every time we reference a pointer. Yes, it will mostly not trigger, but that is just code so I do not see the problem.
Anyway, I'll remove these defensive checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restored the Assert.
tsl/src/remote/connection.c
Outdated
@@ -441,7 +442,7 @@ handle_result_create(PGEventResultCreate *event) | |||
TSConnection *conn = PQinstanceData(event->conn, eventproc); | |||
ResultEntry *entry; | |||
|
|||
Assert(NULL != conn); | |||
Ensure(conn, "no instance data available"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, this should be an Assert()
. The connection is allocated elsewhere and that's where we should have the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, we replace a crash in production builds with an error message. Not every time we reference the instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restored the Assert.
tsl/src/remote/connection.c
Outdated
@@ -452,7 +453,10 @@ handle_result_create(PGEventResultCreate *event) | |||
entry->result = event->result; | |||
/* Add entry as new head and set instance data */ | |||
list_insert_after(&entry->ln, &conn->results); | |||
PQresultSetInstanceData(event->result, eventproc, entry); | |||
if (!PQresultSetInstanceData(event->result, eventproc, entry)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only fail if you pass in bad pointers. We should have the checks where those pointers are set/allocated instead.
If this is really what we want to do, it would be better to wrap these functions in, e.g., a MACRO, and raise the error in the wrapper instead of doing it in every place they are called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is the only place where we use that function, so it seems pointless to add a macro for that reason alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this check.
tsl/src/remote/connection.c
Outdated
/* probably OOM */ | ||
elog(ERROR, "could not get default libpq options"); | ||
} | ||
Ensure(libpq_options, "out of memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is an improvement over the previous error. The pointer was already checked for NULL and the new error message is more generic than the previous one.
tsl/src/remote/connection.c
Outdated
@@ -1024,6 +1026,7 @@ remote_connection_get_result(const TSConnection *conn, TimestampTz endtime) | |||
if (PQconsumeInput(conn->pg_conn) == 0) | |||
{ | |||
pgres = PQmakeEmptyPGresult(conn->pg_conn, PGRES_FATAL_ERROR); | |||
Ensure(pgres, "out of memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check makes sense, as opposed to most of the other ones, since this one actually allocates memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be more defensive in our coding practices. The referenced bug demonstrates quite clearly that this is something we need to improve on.
I'll remove the defensive checks around the instance handling though.
tsl/src/remote/connection.c
Outdated
@@ -397,7 +398,7 @@ handle_conn_destroy(PGEventConnDestroy *event) | |||
unsigned int results_count = 0; | |||
ListNode *curr; | |||
|
|||
Assert(NULL != conn); | |||
Ensure(conn, "no instance data available"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this really should be an
Assert()
because it checks something that shouldn't happen. PQinstanceData just literally gets a pointer from an array and will only fail if it is not found or a passed in pointer is NULL.
Yes, and the point is that it is not entirely clear if we can have a situation where the pointer is not found. If we have, the server will crash in production if we only use an Assert. This is the idea of being defensive and rather error out than crash if we make a mistake.
The check should be elsewhere, e.g., where the allocation of the instance data occurs. If we follow this pattern, we will just have unused error messages every time we reference a pointer.
Not sure I follow you. For this case we will have an error message instead of a crash; not every time we reference a pointer. Yes, it will mostly not trigger, but that is just code so I do not see the problem.
Anyway, I'll remove these defensive checks.
tsl/src/remote/connection.c
Outdated
@@ -441,7 +442,7 @@ handle_result_create(PGEventResultCreate *event) | |||
TSConnection *conn = PQinstanceData(event->conn, eventproc); | |||
ResultEntry *entry; | |||
|
|||
Assert(NULL != conn); | |||
Ensure(conn, "no instance data available"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, we replace a crash in production builds with an error message. Not every time we reference the instance.
tsl/src/remote/connection.c
Outdated
/* probably OOM */ | ||
elog(ERROR, "could not get default libpq options"); | ||
} | ||
Ensure(libpq_options, "out of memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is an improvement over the previous error. The pointer was already checked for NULL and the new error message is more generic than the previous one.
But AFAICT, it only returns NULL if it fails to allocate memory. I have no strong opinion, I can replace it with the original code.
tsl/src/remote/connection.c
Outdated
@@ -452,7 +453,10 @@ handle_result_create(PGEventResultCreate *event) | |||
entry->result = event->result; | |||
/* Add entry as new head and set instance data */ | |||
list_insert_after(&entry->ln, &conn->results); | |||
PQresultSetInstanceData(event->result, eventproc, entry); | |||
if (!PQresultSetInstanceData(event->result, eventproc, entry)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is the only place where we use that function, so it seems pointless to add a macro for that reason alone.
tsl/src/remote/connection.c
Outdated
/* probably OOM */ | ||
elog(ERROR, "could not get default libpq options"); | ||
} | ||
Ensure(libpq_options, "out of memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of testing, catalog corruption is probably also an internal error? Not sure...
Oh, I think it is, but it can be caused by factors out of our control, so we should have checks for it and just not assume that it is always correct.
But all the instances that were added are throwing an error instead of crashing, and it is not trivial to capture a crashing process and deal with it elsewhere in the code. |
The functions `PQconndefaults` and `PQmakeEmptyPGresult` calls `malloc` and can return NULL if it fails to allocate memory for the defaults and the empty result. It is checked with an `Assert`, but this will be removed in production builds. Replace the `Assert` with an checks to generate an error in production builds rather than trying to de-reference the pointer and cause a crash.
## 2.10.2 (2023-04-20) **Bugfixes** * timescale#5410 Fix file trailer handling in the COPY fetcher * timescale#5446 Add checks for malloc failure in libpq calls * timescale#5233 Out of on_proc_exit slots on guc license change * timescale#5428 Use consistent snapshots when scanning metadata * timescale#5499 Do not segfault on large histogram() parameters * timescale#5470 Ensure superuser perms during copy/move chunk * timescale#5500 Fix when no FROM clause in continuous aggregate definition * timescale#5433 Fix join rte in CAggs with joins * timescale#5556 Fix duplicated entries on timescaledb_experimental.policies view * timescale#5462 Fix segfault after column drop on compressed table * timescale#5543 Copy scheduled_jobs list before sorting it * timescale#5497 Allow named time_bucket arguments in Cagg definition * timescale#5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket * timescale#5558 Use regrole for job owner * timescale#5542 Enable indexscan on uncompressed part of partially compressed chunks **Thanks** * @nikolaps for reporting an issue with the COPY fetcher * @S-imo-n for reporting the issue on Background Worker Scheduler crash * @geezhu for reporting issue on segfault in historgram() * @mwahlhuetter for reporting the issue with joins in CAggs * @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view * @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
## 2.10.2 (2023-04-20) **Bugfixes** * #5410 Fix file trailer handling in the COPY fetcher * #5446 Add checks for malloc failure in libpq calls * #5233 Out of on_proc_exit slots on guc license change * #5428 Use consistent snapshots when scanning metadata * #5499 Do not segfault on large histogram() parameters * #5470 Ensure superuser perms during copy/move chunk * #5500 Fix when no FROM clause in continuous aggregate definition * #5433 Fix join rte in CAggs with joins * #5556 Fix duplicated entries on timescaledb_experimental.policies view * #5462 Fix segfault after column drop on compressed table * #5543 Copy scheduled_jobs list before sorting it * #5497 Allow named time_bucket arguments in Cagg definition * #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket * #5558 Use regrole for job owner * #5542 Enable indexscan on uncompressed part of partially compressed chunks **Thanks** * @nikolaps for reporting an issue with the COPY fetcher * @S-imo-n for reporting the issue on Background Worker Scheduler crash * @geezhu for reporting issue on segfault in historgram() * @mwahlhuetter for reporting the issue with joins in CAggs * @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view * @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
## 2.10.2 (2023-04-20) **Bugfixes** * #5410 Fix file trailer handling in the COPY fetcher * #5446 Add checks for malloc failure in libpq calls * #5233 Out of on_proc_exit slots on guc license change * #5428 Use consistent snapshots when scanning metadata * #5499 Do not segfault on large histogram() parameters * #5470 Ensure superuser perms during copy/move chunk * #5500 Fix when no FROM clause in continuous aggregate definition * #5433 Fix join rte in CAggs with joins * #5556 Fix duplicated entries on timescaledb_experimental.policies view * #5462 Fix segfault after column drop on compressed table * #5543 Copy scheduled_jobs list before sorting it * #5497 Allow named time_bucket arguments in Cagg definition * #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket * #5558 Use regrole for job owner * #5542 Enable indexscan on uncompressed part of partially compressed chunks **Thanks** * @nikolaps for reporting an issue with the COPY fetcher * @S-imo-n for reporting the issue on Background Worker Scheduler crash * @geezhu for reporting issue on segfault in historgram() * @mwahlhuetter for reporting the issue with joins in CAggs * @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view * @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
## 2.10.2 (2023-04-20) **Bugfixes** * #5410 Fix file trailer handling in the COPY fetcher * #5446 Add checks for malloc failure in libpq calls * #5233 Out of on_proc_exit slots on guc license change * #5428 Use consistent snapshots when scanning metadata * #5499 Do not segfault on large histogram() parameters * #5470 Ensure superuser perms during copy/move chunk * #5500 Fix when no FROM clause in continuous aggregate definition * #5433 Fix join rte in CAggs with joins * #5556 Fix duplicated entries on timescaledb_experimental.policies view * #5462 Fix segfault after column drop on compressed table * #5543 Copy scheduled_jobs list before sorting it * #5497 Allow named time_bucket arguments in Cagg definition * #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket * #5558 Use regrole for job owner * #5542 Enable indexscan on uncompressed part of partially compressed chunks **Thanks** * @nikolaps for reporting an issue with the COPY fetcher * @S-imo-n for reporting the issue on Background Worker Scheduler crash * @geezhu for reporting issue on segfault in historgram() * @mwahlhuetter for reporting the issue with joins in CAggs * @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view * @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
## 2.10.2 (2023-04-20) **Bugfixes** * timescale#5410 Fix file trailer handling in the COPY fetcher * timescale#5446 Add checks for malloc failure in libpq calls * timescale#5233 Out of on_proc_exit slots on guc license change * timescale#5428 Use consistent snapshots when scanning metadata * timescale#5499 Do not segfault on large histogram() parameters * timescale#5470 Ensure superuser perms during copy/move chunk * timescale#5500 Fix when no FROM clause in continuous aggregate definition * timescale#5433 Fix join rte in CAggs with joins * timescale#5556 Fix duplicated entries on timescaledb_experimental.policies view * timescale#5462 Fix segfault after column drop on compressed table * timescale#5543 Copy scheduled_jobs list before sorting it * timescale#5497 Allow named time_bucket arguments in Cagg definition * timescale#5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket * timescale#5558 Use regrole for job owner * timescale#5542 Enable indexscan on uncompressed part of partially compressed chunks **Thanks** * @nikolaps for reporting an issue with the COPY fetcher * @S-imo-n for reporting the issue on Background Worker Scheduler crash * @geezhu for reporting issue on segfault in historgram() * @mwahlhuetter for reporting the issue with joins in CAggs * @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view * @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
## 2.10.2 (2023-04-20) **Bugfixes** * #5410 Fix file trailer handling in the COPY fetcher * #5446 Add checks for malloc failure in libpq calls * #5233 Out of on_proc_exit slots on guc license change * #5428 Use consistent snapshots when scanning metadata * #5499 Do not segfault on large histogram() parameters * #5470 Ensure superuser perms during copy/move chunk * #5500 Fix when no FROM clause in continuous aggregate definition * #5433 Fix join rte in CAggs with joins * #5556 Fix duplicated entries on timescaledb_experimental.policies view * #5462 Fix segfault after column drop on compressed table * #5543 Copy scheduled_jobs list before sorting it * #5497 Allow named time_bucket arguments in Cagg definition * #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket * #5558 Use regrole for job owner * #5542 Enable indexscan on uncompressed part of partially compressed chunks **Thanks** * @nikolaps for reporting an issue with the COPY fetcher * @S-imo-n for reporting the issue on Background Worker Scheduler crash * @geezhu for reporting issue on segfault in historgram() * @mwahlhuetter for reporting the issue with joins in CAggs * @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view * @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
## 2.10.2 (2023-04-20) **Bugfixes** * #5410 Fix file trailer handling in the COPY fetcher * #5446 Add checks for malloc failure in libpq calls * #5233 Out of on_proc_exit slots on guc license change * #5428 Use consistent snapshots when scanning metadata * #5499 Do not segfault on large histogram() parameters * #5470 Ensure superuser perms during copy/move chunk * #5500 Fix when no FROM clause in continuous aggregate definition * #5433 Fix join rte in CAggs with joins * #5556 Fix duplicated entries on timescaledb_experimental.policies view * #5462 Fix segfault after column drop on compressed table * #5543 Copy scheduled_jobs list before sorting it * #5497 Allow named time_bucket arguments in Cagg definition * #5544 Fix refresh from beginning of Continuous Aggregate with variable time bucket * #5558 Use regrole for job owner * #5542 Enable indexscan on uncompressed part of partially compressed chunks **Thanks** * @nikolaps for reporting an issue with the COPY fetcher * @S-imo-n for reporting the issue on Background Worker Scheduler crash * @geezhu for reporting issue on segfault in historgram() * @mwahlhuetter for reporting the issue with joins in CAggs * @mwahlhuetter for reporting issue with duplicated entries on timescaledb_experimental.policies view * @H25E for reporting error refreshing from beginning of a Continuous Aggregate with variable time bucket
The functions
PQconndefaults
andPQmakeEmptyPGresult
callsmalloc
and can return NULL if it fails to allocate memory for thedefaults and the empty result. It is checked with an
Assert
, but thiswill be removed in production builds.
Replace the
Assert
with an checks to generate an error in productionbuilds rather than trying to de-reference the pointer and cause a
crash.
Fixes #5447